author: Raymond Yee (@rdhyee)
last edit: 2017.07.03
After expending considerable effort using the Wikipedia API to assemble a database of user edit metadata in support of the work reported "Gender Differences in Wikipedia Editing" (a prize-winning short paper for WikiSym 2011), I've been long intrigued by having direct access to the database behind Wikipedia. With direct database access, there would no need to accumulate data by polling the API, a laborious and error-prone process used for the paper. Just sign up for an account on Wikimedia Labs, and connect to the live database replicas. It helps to understand the Mediawiki database layout and to study example MySQL queries that can be run on the replicas. Undoubtedly, there is a learning curve; there's lot of documentation out there but I've found it a chore to make sense of it all. (If you get confused, don't forget to ask for help on the Labs-l Mailing List or the wikimedia labs IRC channel.
In my self-education about the Wikipedia databases, I've been particularly happy to find Quarry, a web application (currently in beta), that allows one to "[r]un SQL queries against Wikipedia & other databases from your browser". You can run SQL queries and see the results in the browser, download results, and also easily share your queries and results with others. You can also see what queries other people have run, from which you can presumably learn much. (Finally, you don't need a shell account on Wikimedia Labs to use the service.)
I've been enjoying Quarry a lot, but wanted to integrated its functionality with other software. Specifically, I wanted to integrate work I'm doing on Quarry with the IPython/Jupyter notebook. To begin with, it'd be convenient to be able to programmatically access the output from queries. It turns out that you can make use of an implicit Quarry API to read not only the output data of a query but associated metadata, including the query SQL itself, the author, title, and date of execution. Some things I describe here:
As I began to write integration with Quarry and surface my own work on Quarry, I then start to explore the question of what other people are doing on Quarry. Such questions as:
This notebook shows some preliminary work in that area.
BTW, don't miss the main documentation page for Quarry and three-part Wikiresearch webinar series, which includes videos ( 1, 2, 3) with some coverage on using Quarry.)
In [ ]:
%matplotlib inline
In [ ]:
from __future__ import print_function
In [ ]:
import datetime
from itertools import (islice, chain)
import re
import time
from collections import (Counter, OrderedDict)
# writing for eventual Python 2/3 compatability
try:
from urllib.parse import urlencode
except ImportError:
from urllib import urlencode
import requests
from lxml.html import fromstring, parse
import matplotlib.pyplot as plt
from pandas import (DataFrame, Series)
from IPython.display import (display, HTML, Markdown)
As a big fan of the music of J. S. Bach, I'm also very interested in the history of the English Wikipedia page for Johann Sebastian Bach. How did the article develop over the years? When was it most actively edited?
For a simple question to answer using Quarry, I wanted to compute the number of revisions by year for Johann_Sebastian_Bach in enwiki.
The easiest workflow with which to begin is to refine a query on Quarry and then download the resultset. That is, for a recent run, I could analyze the output by first downloading the JSON-formatted data from http://quarry.wmflabs.org/run/24183/output/0/json?download=true. But if I were to run the query again, the URL for the output changes again because the run number changes.
For the purposes of analyzing the output from queries in Quarry, instead of manually downloading the output of a Quarry query (and the accompanying metadata), it is ultimately easier to programatically obtain that output.
Through a combination of reverse engineering Quarry, greatly helped the open availability of source code for Quarry, I wrote the following Python functions to obtain the output and to display that output in a Jupyter notebook.
In [ ]:
def quarry_result(query_id, rev_id=None, result_format='json'):
"""
returns a dict {'status', 'output', 'query_meta' 'latest_run', 'latest_rev'} for a query_id on quarry.wmflabs.org
"""
BASE_URL = "http://quarry.wmflabs.org"
# get the metadata for the query
# https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L402
url = BASE_URL + "/query/{0}/meta".format(query_id)
try:
query_meta = requests.get(url).json()
status = query_meta['latest_run']['status']
latest_run_id = query_meta['latest_run']['id']
latest_rev_id = query_meta['latest_rev']['id']
# if status is 'complete' and return_output is True,
# grab the results -- unless result_format is None
# https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L350
if (status == 'complete' and result_format is not None):
# TO DO: figure out whether 0 should be a variable
# re: https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/app.py#L351
# 0 is default value for resultset_id
url = BASE_URL + "/run/{0}/output/0/{1}".format(latest_run_id, result_format)
if result_format == 'json':
output = requests.get(url).json()
else:
output = requests.get(url).text
else:
output = None
return {'query_meta':query_meta,
'status': status,
'latest_run': latest_run_id,
'latest_rev': latest_rev_id,
'output': output
}
except Exception as e:
return e
# assume for now latest rev id same as latest run id.
# 'status':query_meta["status"]
def display_objs_for_q_result(q):
"""
returns IPython/Jupyter display object to describe query metadata and SQL content
(first pass)
"""
description = q['query_meta']['query']['description']
if description is None:
description = ""
return ( HTML("<b>{0}</b>".format(q['query_meta']['query']['title'])),
HTML("<p>id: {0} ({1})</p>".format(q['query_meta']['query']['id'],
q['query_meta']['query']['timestamp'])),
HTML("<p>{0}</p>".format(description)),
Markdown("""```sql\n{0}\n```""".format(q['query_meta']['latest_rev']['sql']))
)
Now to retrieve and display the metadata for the query 3659
http://quarry.wmflabs.org/query/3659
In [ ]:
q = quarry_result(3659, result_format='json')
display(*display_objs_for_q_result(q))
Show the output from the query
In [ ]:
q['output']
Put the output into a pandas DataFrame
In [ ]:
df = DataFrame(q['output']['rows'], columns=['year', 'count'])
df.set_index(keys='year', inplace=True, drop=True)
df
Make a bar chart of the revision counts for the Bach article by year
In [ ]:
df.plot(kind='bar', title="revision counts for JSB article vs year")
This chart alone brings up many questions for future exploration. Why is there a large amount of activity between 2005 to 2007, followed by a dramatic decrease in 2008? Did activity in authorship about Bach-related articles decline only for Johann Sebastian Bach - Wikipedia, the free encyclopedia alone or was there movement in authorship to related articles?
The query I wrote about Bach was only one of the queries I wrote on Quarry. Once I could programmatically retrieve a single query, I wanted to interact with all the queries that I had created. To that end, I wrote user_queries
to retrieve all the queries for a given user by scraping the profile page for a given user. For example, http://quarry.wmflabs.org/RaymondYee has all the queries for the user RaymondYee
.
(I was hoping that the queries would be available to some type of JSON format, but I couldn't find such a source. The source code for Quarry confirms my suspicion that I need to scrape the profile page for a given user.)
In [ ]:
# types of queries
from collections import OrderedDict
query_type_headers = OrderedDict([
('published', 'Published queries'),
('draft', 'Draft Queries'),
('starred', 'Starred Queries')
])
def user_queries(username):
"""
get all queries for a user
e.g., parse http://quarry.wmflabs.org/RaymondYee
"""
url = u"http://quarry.wmflabs.org/{0}".format(username)
r = requests.get(url)
page = requests.get(url).content.decode("UTF-8")
doc = fromstring(page)
# xpath expressions correlate with template
# https://github.com/wikimedia/analytics-quarry-web/blob/e2346c5ec47d63f9514b8aef9190211fa5ff0e45/quarry/web/templates/user.html
# number of queries
queries = dict()
queries['num_queries'] =int(doc.xpath('//*[@class="user-stat"]/h2/text()')[0])
# loop through all the query types
for (qtype, qheader) in query_type_headers.items():
q_elements = doc.xpath('//h3[contains(text(),"{0}")][1]/following-sibling::ul[1]/li/a'.format(qheader))
q_results = []
for q in q_elements:
q_id = int(q.attrib['href'].split('/')[-1])
#result = quarry_result(q_id, result_format=None)
q_results.append( (q_id, q.text))
queries[qtype] = q_results
return queries
Let's now retrieve my queries
In [ ]:
uq = user_queries('RaymondYee')
uq
Putting a bit more work, we can make it easier to jump back to the original queries on Quarry.
In [ ]:
queries_display = ""
queries_display += "<p>number of queries: {0}</p>".format(uq['num_queries'] )
for (qtype, qheader) in query_type_headers.items():
queries_display += "<b>{0}</b><br/>".format(qheader)
for (q_id, q_title) in uq[qtype]:
queries_display += "<div>{0} <a href='http://quarry.wmflabs.org/query/{0}'>{1}</a></div>".format(q_id, q_title)
queries_display += "<br/>"
HTML(queries_display)
We can print out all the SQL for the queries too. Here's the description and SQL for my published and draft queries.
In [ ]:
uq
In [ ]:
ry_results = [quarry_result(q_id, result_format=None)
for q_id in chain([q_id for (q_id, title) in uq['published']],
[q_id for (q_id, title) in uq['draft']])
]
# the following way to display the results is a bit opaque and should be rewritten
display(*list(chain(*[display_objs_for_q_result(r) for r in ry_results if not isinstance(r, Exception)])))
In [ ]:
# problematic queries
# https://quarry.wmflabs.org/query/{q_id/}meta
try:
from itertools import izip as zip
except:
pass
q_ids = list(chain([q_id for (q_id, title) in uq['published']],
[q_id for (q_id, title) in uq['draft']]))
[(id_) for (id_,r) in zip(q_ids, ry_results) if isinstance(r, Exception)]
One of the next steps I'd want to pursue is to figure out how to programmatically write to Quarry, for example, formulate queries in Python and then send them to Qurray. As a Wikimedia Lab user, I'd want to be able to move queries between Quarry and the job submission facility on the Labs.
A major part of the appeal of Quarry is that you are part of a community of users creating queries. I personally wanted to understand what others are doing on Quarry.
To satisfy my curiosity, I decided to scrape the Recent Queries page for a list of queries and their creators by writing runs_list
. The function runs_list
will loop all the queries by paging through the entire history available on Recent Queries.
In [ ]:
def anchor_info_or_text(td):
anchors = td.xpath('a')
if anchors:
#if anchor text in form of "/query/query_id", return just query_id
href = anchors[0].attrib['href']
query_search = re.search("/query/(\d+)", href)
if query_search:
return (int(query_search.group(1)), anchors[0].text )
else:
return (href, anchors[0].text )
else:
return td.text
def filter_none(d):
"""
return dict d without any items with None for value
"""
return dict([(k,v) for (k,v) in d.items() if v is not None])
def runs_list(limit=None, from_=None, _sleep=0):
"""
Generator for all the queries on http://quarry.wmflabs.org/query/runs/all
"""
url = ("http://quarry.wmflabs.org/query/runs/all?" +
urlencode(filter_none({'from':from_, 'limit':limit})))
more_pages = True
while more_pages:
r = requests.get(url)
page = requests.get(url).content.decode("UTF-8")
doc = fromstring(page)
# grab headers
headers = [th.text for th in doc.xpath("//th")]
# yield rows
for tr in doc.xpath("//tr[td]"):
yield [anchor_info_or_text(td) for td in tr]
# next link
next_links = doc.xpath('//li[@class="next"]/a')
if next_links:
url = ("http://quarry.wmflabs.org/query/runs/all?" +
next_links[0].attrib['href'])
else:
more_pages = False
time.sleep(_sleep)
Let's scrape the pages. As of 2015-05-30, there are few enough queries for me to practically get all the pages returned by Quarry.
In [ ]:
print ("time of analysis:", datetime.datetime.utcnow().isoformat(), "\n")
queries = []
# loop and print out which row we're on
for (i, item) in enumerate(islice(runs_list(_sleep=0.5), None)):
print("\r {0}".format(i), end="")
queries.append( (item[0][0], item[0][1], item[1][1], item[2], item[3]))
Display the last five results
In [ ]:
df = DataFrame(queries, columns=['id', 'title', 'creator','status', 'time'])
df.tail()
How many queries records belong to me?
In [ ]:
len(df[df.creator=='RaymondYee'])
How many unique query identifiers and creators are there in the data set?
In [ ]:
# how many unique ids
len(df.id.unique()), len(df.creator.unique())
I am surprised such a small number of unique queries. It's possible my code has a bug that causes me to miss many queries. Alternative, Quarry might only return a limited selection. (To figure out.)
In [ ]:
# loop through users to accumlate data on all user queries
# on the way to computing most starred query
queries_by_user = dict()
for (i, username) in enumerate(islice(df.creator.unique(),None)):
print ("\r {0}".format(i), end="")
queries_by_user[username] = user_queries(username)
Now let's compute the queries that have been starred by others and the number of stars they have received.
In [ ]:
# let's look at what starred
starred_q = Counter()
for (u, d) in queries_by_user.items():
for q in d['starred']:
starred_q.update([q])
starred_q_display = ""
for (q, count) in starred_q.most_common():
starred_q_display += u"<div><a href='http://quarry.wmflabs.org/query/{0}'>{1}</a>: {2}</div>".format(q[0],q[1],count)
HTML(starred_q_display)
There are lot of other aspects of Quarry to explore, not to mention the Wikimedia databases themselves! Some ideas are:
In [ ]: